A Compressed Self-Index for Genomic Databases
نویسندگان
چکیده
Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals’ genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a variant of LZ77 that copies phrases only from the first genome. RLZ achieves good compression and supports fast random access; in this paper we show how to support fast search as well, thus obtaining an efficient compressed self-index.
منابع مشابه
Self - Indexing Based on LZ 77 ? Sebastian
We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that sou...
متن کاملSelf-Index Based on LZ77
We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that sou...
متن کاملCombining Text Compression and String Matching: The Miracle of Self-Indexing
This decade has witnessed the raise of what I consider the most important breakthrough of modern times in text compression and indexed string matching. Selfindexing is the mechanism by which a text is simultaneously compressed and indexed, so that the self-index occupies space close to that of the compressed text, provides random access to any part of it, and in addition supports efficient inde...
متن کاملA Faster Grammar-Based Self-index
To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on grammars. In this paper we show how, given a straight-line program with r rules for a string S[1..n] whose LZ77 parse consists of z phrases, we can store a self-index for S in O(r + z log log n) space such that, given a pattern P [1..m], we can list the occ occurrences ...
متن کاملApplication of Fractal Codes as Similarity Measure for Compressed Image Databases
In image database applications, it is desirable that functions such as searching, browsing, and partial recall be done without totally decompressing the images. Using wavelet-compressed images is becoming increasingly popular. Image databases, and edge images derived from such compressed images can be viewed as indexes that can be queried by examples. In this research, a fractional code generat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1111.1355 شماره
صفحات -
تاریخ انتشار 2011